Mining Frequent Patterns, Associations, & Correlations 



• Motivation: Business transaction records 

- J Discovery of interesting correlation relationships that help business decision- 
making processes (catalog design, cross-marketing, customer shopping 
behavior analysis, ...) 

cii£jLq jjjjoiII <J^£ ujO^ aj^UJI cjVI^I ^ c&Juz j}j±^\ jA mining frequent Sj£* lU*j \S^l ls 11\ 
aiaj Iajuj (j^Blj (j^j*]i j\ j^joiIIj (_^LuiII <_£ j (j^a*j i . 1W a^Jsl^jj diCrUjI ilA-^La. jj£I AjI ^_a^Joj c!j£jLq jj^udll 

I^juj a^jI^j <j-aja ^jjl (jUicr AiicrUj cjjljj.uLi>.»£VI ( . li^ JjUj-all ^!Ln Ja^.1 Cross marketing 
Market basket analysis 

• How to place SW, HW, and Accessories? 

ft /v> -^j (j£-A^ ji& tjc-W^ HW , accessories ^ ji es-"^ 6 ^ ^Wf^'j £^£ <^ j ^^t l l - lu^ ^Jc ^j 

• Frequent patterns are item sets that appear frequently in a dataset (e.g. transaction 
records) 

jjj£ l g->ja T j £* jj^^l <^l items <&j*^ 
l$J^\ f jV L5^J j^j^ ^j^t jl cr^ association rules ^ ^ c£** Frequently associated 

• Support and Confidence are measures of rule interestingness 

^ l^ina. Ul j^BI rule c^ cr^ ^^ ^^ Support and Confidence 

2% support -> 2% of transactions show that computers and AV_SW are 
bought together \y* Ijc-W^ Jj^ u£?A^ <-o^ 

& 60% confidence -> 60% of customers who bought a computer also bought 

transaction l&^ Sjj^il ^ L_a*^> J£ 

itemsets cr* ^j^j^ll ^IjjVI lS^ 

Dataset l&^ c>^ t* ^j^^' J^ 

Occurrence Frequency cr^4f c>^ ^ W^jj^ ^ c^UUJI jj^» cjIj^ ^ 



The Basics 

Frequent Itemsets 




\jjj&\ <j*^V!j jjaII y^ frequent itemset j* j^ threshold c> j^' o^? £* ao^ 3 ^ Jj^ ^ 
J& J o£j* hjj&\ *A& f jV frequent itemset ^^ Jj3t j^ ^ threshold=2 lU^ Ul jl Sj* ^l£ 



support(A => B) = P(A U S) = 



n(4 U B) 

iv 



jj*ti Jjfi <^j* ^ ^ yr^s W^ itemsets ^ ^ <*** item W* jj£j! ^Jl! ch\jA\ ^ ja Support 

4/7=57% C5% u W^ transaction ^ ^ o^Vlj 

.1 4 

support (banana => pineapple) = P (banana U pineapple) — -— 57% 



confidence (A^>B) = P(B\A) = 



n(A u fi) 
n(4) 



6 U^ c^ji^l jj^ c^jiSI -^j ^ cr^ yr^ item lsj^ JjVI item csj^t ^ i> ^lj ^ j* Confidence 



con 



fidence f banana => pineapple) = P(pineapple\banana) = - = 80% 

itemset <jk Jj3t j^ jj^ ^ JSl m/n support count 

^UU c_ix^ ^ ^a^^j ^Ul Ul min_support and min_confidence 

• If a rule satisfies min_support and min_confidence thresholds, it is said to be strong 

c^%? f^j^Af jl min_support and min_confidence thresholds i> j£l ^ cJ^ 1*1^ ^i& ^Jll rule jl 

strong W^ 



i problem of mining association rules reduced to mining frequent itemsets 

• Association rules mining becomes a two-step process: 

1. Find all frequent itemsets with frequently > a predetermined min_support 
count ^Kx ajl^k J&\ jfi*j (jij 

2. Generate strong association rules from the frequent itemsets that satisfy 
min_support and min_confidence 

• If min_support count is set too low -> huge # of frequent itemsets 

frequent itemsets i> ^ j^£ *&> ls^ u&P* J^S min_support count ^j jl 

Mining Frequent Itemsets 
1) Apriori Algorithm 

^-w-aj ajIU ^ j^Aa jla ^jIj item <^ ^jj j^ frequent lA* item cs-^ jJ 
AC 
ACD 



frequent h^M J^ C u^j B m^ yr^ ^lp- l$\ ^jj jl* frequent J^ B ^ threshold=2 jl 

s j* ^ l£ j j^il ^jl item J^ yr^ o*^* . ^ 
candidate itemset W^ ^^ generate lW .^ 
To improve efficiency, use the Apriori property: 

"All nonempty subsets of a frequent itemset must also be frequent' - if a set cannot 
pass a test, all of its supersets will fail the same test as well 
if P(l) < min_support then P(l u A) < min_support 



Level 1 



Transactional data example 

N=9 t mfn_supp count=2 



1 TID 


List of items 


TIOO 


I1 ; I2 ; 15 


T200 


I2 ; I4 


T300 


I2 ; 13 


T400 


I1 ; I2 ; 14 


T500 


IX, 13 


T6 00 


12,13 


T700 


il; 13 


TSOO 


I1 S I2 S I3 S I5 


T900 


I1 S I2 5 13 



Scan dataset for 

count of each 

candidate 



Ci 



itemsjgt 


Support 
count 


01} 


6 


{12} 


7 


{13} 


6 


{14} 


2 


{15} 


2 



\ 



Compare 

candidate support 

with mtn_supp 



\ 



*-i 



It^Cnset Support 
count 


{11} 6 

{12} 7 
{13} 6 
{14} 2 
{15} 2 



(^jjc. Jll Transactions <_£ ja Jj^?> JjI . ^ 

T100 , T400 , T500 , J> CjIj- 1 cjjjSI H ^ s>." r l£ jj£l item J* <-i>5* Jj^ ^ - x 

T700 , T800 ,T900 

\ ^jLjjj jl (j* jjSI ^^K Ua giiU 4li^l threshold=2 (> J* 1 Transactions t/ 1 Jj- 1 ?- ^J T 

Level 2 



Apriori Algorithm 







Scan datasetfor 

count of each 

candidate 



Itemast 


Support 
count 


{11=12} 


4 


{11,13} 


4 


{11,14} 
{11,15} 


1 


-> 


{12,13} 


4 


{I2 ; I4} 


2 


{I2 ; I5} 




{13,14} 





{B : 15} 


1 


{14,15} 






Compare 

candidate support 

with min_supp 



:■ . v^- v v , x . .■ . ■ . 


Support 
count 


{11,12} 4 


{11,13} 


4 


{11,15} 


2 


{12, 13} 


4 


{I2 ; I4} 


2 


{12,15} 


2 



Generate C 2 candidates 
from Ljfjy joining LjXl L, 



^Ull £* item J-5 ti^^j UpW 2 ^^ Jj^ Jj^ . ^ 

transactions *-£ ^ g^- 1 ^ Jj^I ^ <^ -5 Ijjj-jI item u^ l£ ^jA> Jj^ ^Jll .Y 

^ l> JSI IjjjSjI itemset cs I J^A? *-* **-> T 
Level 3 



£* 11,12 -^ cr^ item Jjl ^ o£j^ *A& f jV o^ join f*-»clj ^ J^l ^ ^1 Jj^l c> u£^ l£ -Uj 

?+£jj join ^==.1 C5^? ^ item Jjl >j II ^ u£j-^ W^ 11,13 

jj_VI I— iU items l£ ^ u£j^ *A& f j^ <^£ join l£ j— ^? =J ^ j-<-M c^ 



^^ 


Itemset 


_ _ . 

c 3 


Itemset 


Support 

count 


r 


Itemset 


Support 1 
count 




{11,12,13} 


^ 


{11,12,13} 


2 


^■^ 


{11,12,13} 


2 




{11,12,15} 


{11, 12, 15} 


2 




{11,12,15} 


2 









(=^^ item JjI c^ ujS jii< (jji*^a J£ jlikjU ^*j ( ^ 
{{II, 12, 13}, {II, 12, 15}, {II, 13, 15}, {12, 13, 14}, {12, 13, 15}, {12, 14, 15}} 

11,13 j 11,12 ^ frequent ^j^Uc $**?. jV cw {II, 12, 13}, {II, 12, 15} VI ^ ^U J* ^ 
2 0- ^(support count J&) frequent ^ L«* i2,i5j 11,15 j 11,12 ^EsU {II, 12, 15} <J^j \2,Bj 



^Ij Jjl Not all subsets are frequent ^^ <**^U {U, '3, 15}, {12, 13, 14}, {12, 13, 15}, {12, 14, 15} 
J\A\ ^ I^aj frequent J^ 13,15 lw frequent <= 11,15 j 11,13 {II, 13, 15} ^> 

Level 4 













*■ 


!Mm^ 






-►1 


Not all subsets are frequent 
-> Prune 


C 4 = <|>-^ Terminate 


{I1 : I2 : I3 : I5} 







level 3 c^ frequent W^ c^& <^U j±\ ^Jc Ua^[k frequent ^^ <^^a J^ 



Itemset 



{11,12,13} 

{11,12,15} 

Generating Association Rules from Frequent Itemsets 

Association rules can be generated using the confidence equation, as follows: 

** uj^j l g.i .. n > .i frequent item i> association rule s-^ d^- 

support countCA U B) 

confidence CA^B) = P(B\A) = —^- = , ' 

support_count{A) 

1. For each frequent itemset I, generate all nonempty subsets of I 

2. For every nonempty subset s of I, output the rule: 



(f-s) 



. r support _count(X) . . ^. 7 

if H- > mm confidence 

support_count(s) 



Ig-Lo ^Ij subset J U-^al 1^?Ia {11,12,15} ^j ^^ u^l\ !^ Jj^l Jj^ j^l ^ ^-a^ J£.ii.U 
a^J\ (jj Ij£aj 15 c^lUl! j^j*l\ Jk. \ JJJ \J i ±i{\i / \2}otj^c> Jjl ^U association rule 



Nonempty subsets 


Association 
Rules 


{11,12} 


{I1,I2}=> 15 


{11,15} 
{12, 15} 


{11,15} =>I2 
{12,15} =>I1 


{11} 


{11} ^{12,15} 


{12} 
{15} 


{12} ^{11,15} 
{15} ^{11,12} 



confidence (A => B) = P(B| A) = s " pport - co " nt(AuB) j» Jll Jji JM 6^W confidence m-»U 

v y v i y support_count(A) s 

a^ 6 A^lj Jjl Jc ^i*j nonempty subsets jj^> ^ f^l Jj <j^*j <*-* >^»U*JI J£ jj^ Ate Ja^l J 

l*K j^\±l\ jj^ ch\j* Ate CA2J JU^I J J]| c^^ 1 Jj^ 1 J* {HJ2} jj^» ^ c^ {I1,I2,I5}jj^ 

TlOO , T400 , T800 , T900 U Jit 4= 11,12 jj^> ^lj* ^j T800 j TlOO U Jll t = 



confidence (a=>b) = P(B|A) = - = 50% 



Itemset 



{11,12,13} 

min_confidence ^-*j& confidence <-^ ^ ^j * * * * Jj^t J >^»U*il JU1 j^j 



<^ JSl confidence l^ ^^lj 



min_confidence = 70% 



TID List of items 


TlOO 11,12,15 

T200 12, 14 
T300 12, 13 
T400 11,12,14 
T500 11,13 
T600 12, 13 
T700 11,13 
T800 11,12,13,15 
T900 11,12,13 



Nonempty subsets 1 


Association 
Rules 


{11,12} 
{11, 15} 
{12, 15} 


{11, 12} => 15 
{11,15} =>I2 
{12, 15} => 11 


01} 
{12} 
05} 


{11} => {12,15} 
{12} => {11,15} 

{15} =>{I1, 12} 



Confidence 



2/4 = 50% 
2/2= 100% 
2/2 = 100% 
2/6 = 33% 

2/7 = 29% 
2/2= 100% 



70% c> j£\ confidence u^ ^ Jj^l c> f*^U Jl! Jj^j 



{11, 15} =>YL 


2/2 = 100% 


{12,15} =^>I1 


2/2= 100% 


{15} => {11,12} 


2/2= 100% 



FP-Growth (O^jj?^ ^^io>l^i aK£* JjSIa j£& item lS-^- j^ lh l>^j^ J^ 1 6 ^ <Ooj?^ 



2) FP-Growth 

• To avoid costly candidate generation 



• Divide-and-conquer strategy: 



candidate J-^l ^^ J£ lA* u^^ <uj±1lmUj 



1. Compress database representing frequent items into a frequent pattern 
tree (FP-tree) - 2 passes over dataset 

2. Divide compressed database (FP-tree) into conditional databases, then mine 
each for frequent itemsets - traverse through the FP-tree 



runtime J lS^I u^ tree J^ ^ j^U! ^ <J& transaction g^W 

J£ jj^la cjIj-o *nc tjni^i ^jtii <Jc- a^jjjIa 6^£ ^*jj aj! ^ jj^il item <J^ ^-fl^>J jLic jjjjljl^i] scan lUc-^-a 

item 



Transactional data example 
A/=9, min_supp count=2 






Scan datasetfor 

count of each 

candidate 










Compare 

candidate support 

with min_supp 




TID List of items 


\ 












V 


\ 


Li- Reordered 


TlOO 11,12,15 

T200 12, 14 


\ 




\ 






Ci 


Itemset 


Support 
count 


\ 




Itemset 


Support 
count 


T300 12, 13 


i 


un 


6 






{12} 


7 


T400 11,12,14 




{12} 


7 


{11} 


6 


T500 11,13 




{I3> 


6 




{13} 


6 


T600 12, 13 
T700 11,13 
T800 11,12,13,15 


{14} 


2 




{14} 


2 


{15} 


2 


{15} 


2 








T900 11,12,13 





























TlOO, T400, T500, T700, T800, T900 £*!>• 6 jj^< <£** ij* f l£ jj£! 11 j* ^ffl item Jjl ^U 

W4^ min_supp count c> c^l ^->jj£Jt a^U. 4_ja j] ^ j v* > ^11 j#£II (> ?&*j\j j-^UtJI J£ <^a Ij£aj 

£ null{} 
jA L_L^a Jjl <L<uaij! (Jj^aVl JjAaJl ^J t <j>^i j£ j^J 6^£ ^*Jj null 4- 1 TOOt cl^^A 4-^La> Jjl 

l_j |i J^jlj tree^ 12 lU»j< f*^»J ^j 15 ^ II £ 12 C5% u ^^ gO ^J 11,12,15 *>^»Uc ^1 TlOO 

6^£ ^Uj aj! jS jj£j! ja item l£ ^^ ^^j II M I5j 12 




r^ull i > 



^ jj^j^j^ Ua Ja ^ij^l 14 ^ 12 cs^ JjVI ^j! -u^jll ^ ^jjU |2,I4 j* ^1 T200 ^^ ^ 
*& ^JLjUj |2:2 ^ 12 jlj£ ^ ±jjh |4 J ^ t$i* JjjU |2 ^^U ^A item c^ jl a-^jII 




null £} 



6^£ tree jj^U j±jy\ ^j Jj^l l£ j^KIU ^J li£*j 



























■ null 


{} 




















| 12:7 










(J 

4:1 


11:2 

13:2 


| 11:4 




f 13:2 N 










J"l5:l 1 13:2 


14:1 












15 


:1 









and you get that item's support count 



Z.-, - Reordered 



Item set Support 



null{} 



{12} 
{11} 
{13} 

{14} 
{15} 



7 
6 
6 

2 
2 




13:2 



For Tree 
Traversal 



lH *j* t> j^ ^-uj^ uj^ l£^« 6 -^j s j* 6 ^j^j^ 12 gr^ node link ^ <^ jj^> tree ^ item l£ 
II ^ node link t> ^ £^1 ^ upj* c^^ji\ u ^j -Oaaill ^ node link c> jj^ ^j *j* ^j^j* 

^J£ ^k IJ£fcj |1 JaAJ J^ >*£ 

rooted J^»jl ^ ^ 15 cs^ g^t o^ ^^ ^-^ j^W 1-4* cr^ Bottom-up algorithm ^ ^ jjjj^JVI 
<^^ min_support count threshold i> JSl U^^a frequency <-ij^U 15 



Bottom-up algorithm — start from leaves and go 

up to root — 15 for example has two paths to root 



L^ - Reordered 



Bsro^ 


Support 
count 


Node 
Link 


{12} 


7 




{11} 


6 





{13} 


6 




{14} 


2 


1 


{15} 


2 






U^jl node J^ Conditional FP-tree Construction cWU 

f^Jj 15 ?&* J^ta T100 , T800 C5^j^ i5 <^^ Jjl cfi Jj ^ node (Jhj** ^ transaction l£ J^U 
^1 ^Ij ^ja CaUoU ^j o^u ^ ^^^jU T800(I1,I2,I3) j T100(I1,I2) 15 ^^l» *u ^^i ^W ^ 

min support t> l& ^-uj^ node 



nullO 



L 2 - Reordered 










It em set 


Support 
count 


Node 
Link 




{12} 


7 


-- 






6 




{H> 




- 


{13 > 


6 


-- 


{14} 


2 




{15 > 


2 





^ ^ ^ 



X 3:i 



Eliminate transactions 
not includmsI5 



Eliminate 15 



U$i ^.Ij ^j ciijl£ JjV! ^pj^I ^ 12 o^ II ^^j T800 £* *j* j T100 £* *j* jjjj* jj£j! item J&l 12 
min support i> JSl W^V l^L^l^i T800 c^ o^ 6 -^j s j* ^-UJ^ <J& 13 f^^j II ^ ^j JjVI c^^ 



1 ) Condition pattern base 



2) Condition FP-tree 



3) Frequent patterns generated 



6^^.lj J£ jlj^i j^cj d Vvsll ^u 6^j^.jx (^111 node 6 ^j 

al Jj^aJl xllajlAj (£j AjIajlII (jjaAJ A-ilc. JaC-Ij item <J^ t^Labalj 



| Item 


Conditional Pattern Base 


Conditional FP-tree 


Frequent Patterns Generated 


15 
14 


{{12,11:1}, {12,11,13:1}} 


(/2:2,/l:2) 


|l2,I5:2},{Il,I5:2}, 
|I2,I1,I5:2} 


{{12,11:1}, {12:1}} 


</2:2> 


{12,14:2} 


13 


{{12, 11: 2}, {12:1}, {11: 2}} 


(12: 4, 71:2), (71:2) 


{I2 ; I3:4} : {11,13:4}, 
{12,11,13: 2} 


11 


{{12: 4}} 


(72:4) 


{12,11:4} 



Paths to which item is suffix Prefix paths to item 

after eliminating 
infrequent items 



Pattern Evaluation Methods 

• Not all association rules are interesting 

<ujV Ifcil l£l£^ Cfi^ *W^j 6^ c^H U**^ g^ association rule l£ -^>^ l£* 

i buys(X, "computer games") ^>buys(X, "videos") [40%, 66%] 

i P("videos") is already 75% > 66% 

- J The two items are negatively associated -> buying one decreases the 
likelihood of buying the other 

* We need to measure "real strength" of rule 
9 Correlation analysis 

q^*-u aS!^ cj^-L \j\ q^ju ^a IjILj j\ q^u ^a IjAjjjj item u£^ ^ta Correlation 

A => B [support, confidence, correlation} 

aAL^L correlation >j u^ parameter AjjW l^*^ ^ asscociation rule Ji evalution lW^ j^- 

confidence j support ^ 

i. u/ t = PG4UB) 



P{A)P(B) 

• >A and B are independent if P(^ U B)=P(A)P(B) 

o^u J& jj^ix^ ^ A,B y^j ^ c^jU^ f*2^ L5^ P(i4 U B)=P(A)P(B) jl 
& Otherwise, dependent and correlated occurrence 

f-^A^ correlation ^j o^? y^- l«^*-* ^iy ^lj (jj^t lA* jl 

^ If Zi/t < 1, >A is negatively correlated with B 

A is negatively correlated with B J*x > t> j^ %&& jl 

If lift > 1, >A is positively correlated with B -> A's occurrence "lifts" the 
occurrence of B 

, A is positively correlated with B -> A's occurrence "lifts" the occurrence of B ^ > c> j£\ ^ 

2. x 2 "^ already discussed in a previous lecture 

6^ Jji ^k UU^l ^j chi square *?>U ^ 



